Skip to content

Fix firstOnly selection behavior #152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 20, 2025
Merged

Conversation

jrom99
Copy link
Contributor

@jrom99 jrom99 commented Sep 12, 2024

firstOnly used to select one match per combination (object+chain+segi), so if one object had multiple chains, each chain would match once.

This makes it so that firstOnly will only match one time per object, on the first segi+chain available (in alphabetical order).

`firstOnly` used to select one match per combination (object+chain+segi), so if one object had multiple chains, each chain would match once.

This makes it so that `firstOnly` will only match one time per object, on the first segi+chain available (in alphabetical order).
@jrom99
Copy link
Contributor Author

jrom99 commented Sep 12, 2024

Another issue I've noticed is that for objects which have residue sequence data available, but the residues don't have structural information (like in loops), this script is unable to find the sequence to select.

But I'm not sure how to update the selection behavior to fix it.

@pslacerda
Copy link
Member

I'll check the firstOnly issue soon. Can you dump here some script that check findseq use case?

And about your missing residue issue, I have an idea that seems to work. For sure the sequence data is available in RCSB PDB and mmCIF files but may be missing when they are from other sources, I don't know when it is the case.

The API only cmd.get_fastastr is a command related to findseq, may you see if it works in your case? In my case the FASTA string is retrieved complete but iterate and cmd.get_model aren't, and I don't know why.

https://github.com/schrodinger/pymol-open-source/blob/9d3061ca58d8b69d7dad74a68fc13fe81af0ff8e/modules/pymol/exporting.py#L169

@pslacerda
Copy link
Member

pslacerda commented Sep 12, 2024

The ONE_LETTER table has some errors like the map 'CRF':'TWG' which will ruin the analysis, in case of matching. There are also cases like 'A ':'A', which are not useful.

Edit: I checked some values in ONE_LETTER and I don't trust it.

@pslacerda
Copy link
Member

I reverted your commits as I tested it a few hours ago and they weren't working. I shouldn't had merged it without testing.

You can recover your commits by this PR branch, if you need. Take your time...

Remove dependency on hardcoded ONE_LETTER dictionary
@jrom99
Copy link
Contributor Author

jrom99 commented Jun 19, 2025

cmd.get_fastastr gave the wrong fasta string for my test file (1qys), returning a sequence with "?" instead of "M". However, cmd.iterate was able to retrieve those residues without structural information, so now we can find them as well. I took the liberty of renaming the firstOnly to matchMode to reflect the behavior for multiple objects.

It worked as expected on my files, but I'd need your files and tests to check if it is working as expected.

@pslacerda
Copy link
Member

pslacerda commented Jun 19, 2025

Note that the residues without structural information can be retrieved only if using a file with appropriate metadata. In custom made files it will skip missing residues, I guess. If pertinent, it should be handled by code or explicitly stated at documentation.

@jrom99
Copy link
Contributor Author

jrom99 commented Jun 20, 2025

Note that the residues without structural information can be retrieved only if using a file with appropriate metadata. In custom made files it will skip missing residues, I guess. If pertinent, it should be handled by code or explicitly stated at documentation.

Do you have one of these so I can test the code?

@pslacerda
Copy link
Member

I tested your branch patch-1, and it's working but conflicting. Can you force push or something?

@jrom99
Copy link
Contributor Author

jrom99 commented Jun 20, 2025

I don't know how to force push, so I edited the conflict thing.

Another thing that I don't know if we should document on the help function is that this code (and the original version as well) only search for non-overlapping matches. In a future update, how about we add the option to create a group and put each match into its own selection, which would allow for overlapping matches?

@pslacerda pslacerda merged commit 0251a56 into Pymol-Scripts:master Jun 20, 2025
@pslacerda
Copy link
Member

I don't understand this regex in the examples. It is multi-character? How it works?

        # find the Potential N-linked glycosylation sites in 5fyj
        fetch 5fyj
        findseq N(?=[^P][ST]), 5fyj and chain G+B, 5fyj_pngs

@jrom99
Copy link
Contributor Author

jrom99 commented Jun 21, 2025

I don't understand this regex in the examples. It is multi-character? How it works?

        # find the Potential N-linked glycosylation sites in 5fyj
        fetch 5fyj
        findseq N(?=[^P][ST]), 5fyj and chain G+B, 5fyj_pngs

It looks for a single character. This regex is using a lookahead assertion to match only N's that are followed by XY where X is not P and Y can be either S or T.

From wikipedia:

On the other hand, the attachment of a glycan residue to a protein requires the recognition of a consensus sequence. N-linked glycans are almost always attached to the nitrogen atom of an asparagine (Asn) side chain that is present as a part of Asn–X–Ser/Thr consensus sequence, where X is any amino acid except proline (Pro).[4]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants